Parrotpark: Why and How to self-host LLMs
Jonas Stettner | CorrelAid @ CDL
2025-05-07
Agenda
Why self-hosting?
How to self-host? (Comparison of options)
Introduction to Parrotpark
Demonstration
Disucssion
Why to self-host: Use of proprietary LLM applications
Dependence, lack of transparency and little control:
Data processing (GDPR)
Resource consumption
Properties and training of the models
Model and tool usage/configuration; e.g. web search (🗲 GUI apps such as GPT Builder)
Alternative: Self-Hosting
Chat interface and API bridge trivial to self-host - ✅ Model and tool usage/configuration
LLM inference:
Azure OpenAI on EU servers - ✅ GDPR
Open models
- ✅ More transparent model
API services hosted in the EU
Dedicated GPU server - ✅ Fully transparent resource consumption (only inference)
Dedicated vs API: Costs for EU Provider Scaleway
Claude 4 Opus on Open Router: $15/M input tokens; $75/M output tokens
GPT-4o on Open Router: $2.50/M input tokens; $10/M output tokens
Dedicated vs API: Costs for EU Provider Scaleway
How much VRAM can we afford?:
L4
with 24GB, limits model choices + context window size
Dedicated vs API: Example Pricing Calculation
Scaleways allows automated GPU instance creation (unlike Hetzner), so we deploy only during working hours
\(\text{Cost} = \text{€}0.75 \times (10\,\text{h} \times 5\,\text{days} \times 4\,\text{weeks}) = \text{€}150\)
Including tax (19%):
€178.50
Mistral Small 3.2 24B via OpenRouter (assuming 50/50 input/output split):
€178.50 = $210.27 (at €1 = $1.178)
\(\frac{\text{\$}105.14}{0.05} + \frac{\text{\$}105.14}{0.10}\)
=
3,154M tokens
Per working day:
\(\frac{3{,}154}{20} = 158\,\text{M tokens/day}\)
Dedicated vs API: Why Dedicated?
Maximum control and transparency
More predictable/fixed costs
One can fit more stuff on the GPU server
Embedding and reranking models
Frontend and api bridge
Exact metrics on hardware and inference server level
What is Parrotpark?🦜
Infrastructure project: Self-hosting of LLMs, LLM APIs and frontends
Defined conditions
Servers in the EU with as much GPU as NPOs could afford (20GB)
Hosting of open models
Output in German language
Research and development: How far can you get under the conditions, which applications can be implemented?
High Level Overview
]
Implementation
IaC (Infrastructure as Code): Replicable infrastructure through declaration as code
Quantization of smaller models
Use of existing open source projects
LLM inference server
Frontends for interaction with LLMs
So far no fine-tuning, but instead prompt optimization for certain applications
Evaulation
Time window: June 17th to June 27th (9 working days)
Scraped Metrics: http://mtbs.correlaid.org/public/dashboard/6032e4e9-e87a-49d7-bd67-f0d92552cc1c
User Survey
Evaluation: Tokens and Pricing
Total processed tokens:
329,503
input /
103,083
output
❌ API service for the same model would have cost waaaaay less:
$0.027
vs ~(€178.5/2)=€89
Demonstration
GUI
Messung des Stromverbrauchs